# Load the rgeoboundaries package
library(rgeoboundaries)
# Download administrative boundaries for Liberia (level 0: country)
liberia_admin0 <- gb_adm0("Liberia")
# Plot the administrative boundaries
plot(liberia_admin0$geometry)Spatial Data Analysis with Generalized Additive Models (GAMs) in R
Introduction
To illustrate the methodology, I will use the example on river blindness in Liberia from Model-Based Geostistics for Global Public Health: Methods and Applications by Diggle and Giorgi and Model-based geostatistics for global public health using R by Giorgi and Fronterre.
River blindness in Liberia
The data considered in our example were collected as part of the Rapid Epidemiological Mapping of Onchocerciasis (REMO), an Africa-wide initiative conducted in 2011 across 20 countries (Zouré et al. 2014). The purpose of REMO was to identify areas where river blindness (onchocerciasis), a parasitic disease transmitted by black flies that breed along fast-flowing rivers, remains a public health problem. Communities with a prevalence of river blindness over 20% are prioritised for treatment
For our example, we will model nodule prevalence in Liberia. Palpable nodules under the skin are formed around adult Onchocerca volvulus worms, the parasite responsible for river blindness. Because detecting nodules via physical examination is far less costly and logistically demanding than laboratory-based or molecular diagnostic methods, nodule prevalence is a practical and widely used proxy for infection in large-scale epidemiological surveys.
Using this dataset, I will demonstrate how to conduct spatial data analysis using the R software package.
Handling Spatial Data in R
Downloading Administrative Boundaries
The geoBoundaries datasets, which are accessible via the rgeoboundaries package in R, provide openly available administrative boundary data for nearly every country.
The following code imports and displays the border of Liberia.
The following code imports and displays the first level administrative divisions of Liberia, which are called counties.
# Do the same for level 1: counties
liberia_admin1 <- gb_adm1("Liberia")
# Plot the administrative boundaries
plot(liberia_admin1$geometry)The following code imports and displays the second level administrative divisions of Liberia, which are called districts.
# Do the same for level 2: districts
liberia_admin2 <- gb_adm2("Liberia")
# Plot the administrative boundaries
plot(liberia_admin2$geometry)Downloading Population Data
High-resolution population data for Liberia (and other nations) from WorldPop can be easily accessed using the wpgpDownloadR package. The WorldPop datasets offer gridded population estimates at various spatial resolutions.
Before downloading the data, we will search for available datasets for Liberia. The function wpgpListCountryDatasets() helps in retrieving a list of all available datasets for a specified country.
library(wpgpDownloadR)Loading required package: RCurl
# Search for datasets available for Liberia using the ISO3 country code
lbr_datasets <- wpgpListCountryDatasets(ISO3 = "LBR")
lbr_datasets[1:20,] ISO ISO3 Country Covariate
125 430 LBR Liberia ppp_2000
374 430 LBR Liberia ppp_2001
623 430 LBR Liberia ppp_2002
872 430 LBR Liberia ppp_2003
1121 430 LBR Liberia ppp_2004
1370 430 LBR Liberia ppp_2005
1619 430 LBR Liberia ppp_2006
1868 430 LBR Liberia ppp_2007
2117 430 LBR Liberia ppp_2008
2366 430 LBR Liberia ppp_2009
2615 430 LBR Liberia ppp_2010
2864 430 LBR Liberia ppp_2011
3113 430 LBR Liberia ppp_2012
3362 430 LBR Liberia ppp_2013
3611 430 LBR Liberia ppp_2014
3860 430 LBR Liberia ppp_2015
4109 430 LBR Liberia ppp_2016
4358 430 LBR Liberia ppp_2017
4607 430 LBR Liberia ppp_2018
4856 430 LBR Liberia ppp_2019
Description
125 Estimated total number of people per grid-cell 2000 The dataset is available to download in Geotiff format at a resolution of 3 arc (approximately 100m at the equator)
374 Estimated total number of people per grid-cell 2001 The dataset is available to download in Geotiff format at a resolution of 3 arc (approximately 100m at the equator)
623 Estimated total number of people per grid-cell 2002 The dataset is available to download in Geotiff format at a resolution of 3 arc (approximately 100m at the equator)
872 Estimated total number of people per grid-cell 2003 The dataset is available to download in Geotiff format at a resolution of 3 arc (approximately 100m at the equator)
1121 Estimated total number of people per grid-cell 2004 The dataset is available to download in Geotiff format at a resolution of 3 arc (approximately 100m at the equator)
1370 Estimated total number of people per grid-cell 2005 The dataset is available to download in Geotiff format at a resolution of 3 arc (approximately 100m at the equator)
1619 Estimated total number of people per grid-cell 2006 The dataset is available to download in Geotiff format at a resolution of 3 arc (approximately 100m at the equator)
1868 Estimated total number of people per grid-cell 2007 The dataset is available to download in Geotiff format at a resolution of 3 arc (approximately 100m at the equator)
2117 Estimated total number of people per grid-cell 2008 The dataset is available to download in Geotiff format at a resolution of 3 arc (approximately 100m at the equator)
2366 Estimated total number of people per grid-cell 2009 The dataset is available to download in Geotiff format at a resolution of 3 arc (approximately 100m at the equator)
2615 Estimated total number of people per grid-cell 2010 The dataset is available to download in Geotiff format at a resolution of 3 arc (approximately 100m at the equator)
2864 Estimated total number of people per grid-cell 2011 The dataset is available to download in Geotiff format at a resolution of 3 arc (approximately 100m at the equator)
3113 Estimated total number of people per grid-cell 2012 The dataset is available to download in Geotiff format at a resolution of 3 arc (approximately 100m at the equator)
3362 Estimated total number of people per grid-cell 2013 The dataset is available to download in Geotiff format at a resolution of 3 arc (approximately 100m at the equator)
3611 Estimated total number of people per grid-cell 2014 The dataset is available to download in Geotiff format at a resolution of 3 arc (approximately 100m at the equator)
3860 Estimated total number of people per grid-cell 2015 The dataset is available to download in Geotiff format at a resolution of 3 arc (approximately 100m at the equator)
4109 Estimated total number of people per grid-cell 2016 The dataset is available to download in Geotiff format at a resolution of 3 arc (approximately 100m at the equator)
4358 Estimated total number of people per grid-cell 2017 The dataset is available to download in Geotiff format at a resolution of 3 arc (approximately 100m at the equator)
4607 Estimated total number of people per grid-cell 2018 The dataset is available to download in Geotiff format at a resolution of 3 arc (approximately 100m at the equator)
4856 Estimated total number of people per grid-cell 2019 The dataset is available to download in Geotiff format at a resolution of 3 arc (approximately 100m at the equator)
Raster data consists of a grid of cells, where each cell holds a value representing a spatial attribute such as elevation or temperature. The terra package in R is designed to work with raster data. We can use the following code to download population data for Liberia for the year 2014 at a 100m resolution.
# Load the terra package
library(terra)terra 1.8.70
#Raster file of population data for Liberia for the year 2014 at a 100m resolution
lbr_pop_url <- wpgpGetCountryDataset(ISO3 = "LBR", covariate = "ppp_2014")
# Import the raster file
lbr_pop_100 <- rast(lbr_pop_url)
# Inspect the raster data
print(lbr_pop_100)class : SpatRaster
size : 5040, 4945, 1 (nrow, ncol, nlyr)
resolution : 0.0008333333, 0.0008333333 (x, y)
extent : -11.48625, -7.365417, 4.352084, 8.552084 (xmin, xmax, ymin, ymax)
coord. ref. : lon/lat WGS 84 (EPSG:4326)
source : lbr_ppp_2014.tif
name : lbr_ppp_2014
min value : 0.006426936
max value : 92.716874581
Here’s a plot of the population using the original scale
# Basic plot of the raster (original scale)
plot(lbr_pop_100, main = "Population per 10,000 m^2")
plot(liberia_admin0$geometry, add = TRUE)Here’s the same plot on the log-transformed scale.
# Basic plot of the raster (original scale)
plot(log10(lbr_pop_100), main = "Population per 10,000 m^2 (log10)")
plot(liberia_admin0$geometry, add = TRUE)Downloading Elevation Data
The elevatr package currently provides access to elevation data from AWS Open Data Terrain Tiles and the Open Topography Global Datasets API for raster digital elevation models. For point elevation data,the USGS Elevation Point Query Service may be used or the point elevations may be derived from the AWS Tiles.
We can use the following code to download elevation data for Liberia.
# Download elevation data
library(elevatr)elevatr v0.99.0 NOTE: Version 0.99.0 of 'elevatr' uses 'sf' and 'terra'. Use
of the 'sp', 'raster', and underlying 'rgdal' packages by 'elevatr' is being
deprecated; however, get_elev_raster continues to return a RasterLayer. This
will be dropped in future versions, so please plan accordingly.
liberia_elev <- get_elev_raster(locations = liberia_admin0,
z = 5, clip = "locations")Registered S3 method overwritten by 'httr':
method from
print.cache_info hoardr
Mosaicing & Projecting
Clipping DEM to locations
Note: Elevation units are in meters.
liberia_elev <- as(liberia_elev, "SpatRaster")Here’s a plot of the elevation data.
# Plot the result
plot(liberia_elev, main = "Elevation in Liberia (m)")
plot(liberia_admin0$geometry, add = T)Convert Data Frame to Spatial Object
In geospatial analysis, data is often provided in tabular formats like CSV files that contain spatial coordinates (e.g., latitude and longitude). To use these data effectively in R, it is necessary to convert the data frame into an sf object, which is the standard format for working with (vector) spatial data in R.
The following code loads the Liberia river blindness data from the RiskMap package and converts it into an sf object.
# Load the sf package
library(sf)Linking to GEOS 3.13.0, GDAL 3.8.5, PROJ 9.5.1; sf_use_s2() is TRUE
# Load the RiskMap package
library(RiskMap)
# Load the Liberia data set
data("liberia")
# Convert the data frame to an sf object
liberia_sf <- st_as_sf(liberia,
coords = c("long", "lat"),
crs = 4326)
# Inspect the new sf object
liberia_sfSimple feature collection with 90 features and 4 fields
Geometry type: POINT
Dimension: XY
Bounding box: xmin: -11.40335 ymin: 4.42552 xmax: -7.4928 ymax: 8.46911
Geodetic CRS: WGS 84
First 10 features:
ntest npos elevation log_elevation geometry
1 50 14 17.82621 2.880670 POINT (-10.4671 6.2878)
2 46 10 104.85070 4.652537 POINT (-10.45111 6.5725)
3 43 11 119.09543 4.779925 POINT (-10.02615 6.56667)
4 50 10 144.10921 4.970571 POINT (-10.28775 6.73333)
5 48 9 19.03508 2.946283 POINT (-10.03595 6.016667)
6 50 13 16.99954 2.833186 POINT (-10.335 6.266666)
7 43 9 137.75360 4.925467 POINT (-9.63333 6.13541)
8 50 11 101.75000 4.622519 POINT (-10.06875 6.2375)
9 47 11 147.12867 4.991308 POINT (-9.73333 6.3869)
10 43 0 26.61054 3.281308 POINT (-9.7856 5.7354)
Here is a default plot of the attributes for the Liberia river blindness data.
plot(liberia_sf)The st_as_sf() function converts the data frame into an sf object. The coords argument specifies which columns contain the spatial coordinates and the crs argument assigns the coordinate reference system (CRS). The code EPSG:4326 represents WGS84, the most commonly used geographic CRS, which uses latitude and longitude to describe locations on the Earth’s surface. It is the default CRS for global datasets and GPS systems.
The sf object can now be used for operations such as spatial joins, distance calculations, and mapping with other spatial layers. Note that the columns containing the spatial coordinates have been replaced by a geometry column, which now stores this information. If you would like to retain the original coordinate columns in the output, you can set the remove argument to FALSE when converting the data frame to an sf object.
Working with CRSs
When working with data from multiple sources, such as environmental layers, population data, or administrative boundaries, ensuring that all datasets share the same CRS is essential for accurate spatial analysis.
Checking the CRS
The st_crs() function retrieves the CRS for vector data (sf objects).
# Check the CRS of the vector data
st_crs(liberia_sf)Coordinate Reference System:
User input: EPSG:4326
wkt:
GEOGCRS["WGS 84",
ENSEMBLE["World Geodetic System 1984 ensemble",
MEMBER["World Geodetic System 1984 (Transit)"],
MEMBER["World Geodetic System 1984 (G730)"],
MEMBER["World Geodetic System 1984 (G873)"],
MEMBER["World Geodetic System 1984 (G1150)"],
MEMBER["World Geodetic System 1984 (G1674)"],
MEMBER["World Geodetic System 1984 (G1762)"],
MEMBER["World Geodetic System 1984 (G2139)"],
MEMBER["World Geodetic System 1984 (G2296)"],
ELLIPSOID["WGS 84",6378137,298.257223563,
LENGTHUNIT["metre",1]],
ENSEMBLEACCURACY[2.0]],
PRIMEM["Greenwich",0,
ANGLEUNIT["degree",0.0174532925199433]],
CS[ellipsoidal,2],
AXIS["geodetic latitude (Lat)",north,
ORDER[1],
ANGLEUNIT["degree",0.0174532925199433]],
AXIS["geodetic longitude (Lon)",east,
ORDER[2],
ANGLEUNIT["degree",0.0174532925199433]],
USAGE[
SCOPE["Horizontal component of 3D system."],
AREA["World."],
BBOX[-90,-180,90,180]],
ID["EPSG",4326]]
The crs() function retrieves the CRS for raster data.
# Check the CRS of the raster data
crs(lbr_pop_100, proj = TRUE, describe = TRUE) name authority code area extent
1 WGS 84 EPSG 4326 World -180, 180, -90, 90
proj
1 +proj=longlat +datum=WGS84 +no_defs
Reprojecting Spatial Data to a Different CRS
For vector data, st_transform() reprojects the data into a specified CRS. This example transforms the liberia point data from WGS84 into UTM. To know what’s the correct UTM zone and hence EPSG code for Liberia we can use the propose_utm function from the RiskMap package.
# Obtain EPSG code for UTM for Liberia
propose_utm(liberia_sf)[1] 32629
Reproject the Liberia river blindness dataset to UTM.
# Reproject the vector data
liberia_sf_utm <- st_transform(liberia_sf, crs = propose_utm(liberia_sf))Reprojecting raster data is more complex than reprojecting vector data due to the continuous nature of raster grids. The process involves recalculating cell values to fit a new grid based on the new CRS, which can lead to challenges like resampling, distortion, and data loss. When reprojecting a raster, the grid must adjust to the new CRS, often requiring resampling of cell values. The method you choose depends on the data type: nearest neighbor is best for categorical data like land use while bilinear or cubic interpolation is good for continuous data like temperature, where smooth transitions are needed.
Here we reproject the Liberia population raster to the same UTM as the reprojected Liberia river blindness dataset.
lbr_pop_100_utm <- project(lbr_pop_100,
crs(liberia_sf_utm),
method = "bilinear")
plot(lbr_pop_100_utm)It typically better to reproject vectors rather than rasters when both data types are used together and avoid as much as possible changing the CRS of rasters. One simple recommendation is to work with WGS84 when performing all spatial operations like extraction of covariates from rasters and then transform the point data to a projected CRS before fitting the model.
Extracting Covariates
Covariate data often come from raster or polygon sources, and extracting values for point locations is essential to link spatial context to point-referenced data. Here we show how extract covariates at point locations from both polygon and raster layers.
A common task in spatial data analysis is to assign attributes from a polygon layer to a set of point-referenced data. This is done using a spatial join, which overlays the point and polygon layers and attaches the attributes of the enclosing polygon to each point.
In the following R code, we assign administrative region names (admin1, admin2) to the point locations in Liberia river blindness dataset.
# Perform a spatial join to transfer polygon attributes to the points
liberia_sf1 <- st_join(liberia_sf, liberia_admin1["shapeName"]) %>% dplyr::rename(county=shapeName) %>%
st_join(liberia_admin2["shapeName"]) %>% dplyr::rename(district=shapeName)
# View the results, points now include covariates from the polygon layers
head(liberia_sf1)Simple feature collection with 6 features and 6 fields
Geometry type: POINT
Dimension: XY
Bounding box: xmin: -10.4671 ymin: 6.016667 xmax: -10.02615 ymax: 6.73333
Geodetic CRS: WGS 84
ntest npos elevation log_elevation county district
1 50 14 17.82621 2.880670 Margibi Mambah Kaba
2 46 10 104.85070 4.652537 Montserrado Todee
3 43 11 119.09543 4.779925 Margibi Gibi
4 50 10 144.10921 4.970571 Margibi Kakata
5 48 9 19.03508 2.946283 Grand Bassa St. John River City
6 50 13 16.99954 2.833186 Grand Bassa Owensgrove
geometry
1 POINT (-10.4671 6.2878)
2 POINT (-10.45111 6.5725)
3 POINT (-10.02615 6.56667)
4 POINT (-10.28775 6.73333)
5 POINT (-10.03595 6.016667)
6 POINT (-10.335 6.266666)
Raster data provides continuous spatial information, such as elevation, climate data, or population density. Covariate values from raster layers can be extracted for specific points using the extract() function from the terra package. Each point will receive the value of the raster cell it overlaps.
# Extract raster values at the point locations
covariate_values <- terra::extract(lbr_pop_100, liberia_sf1)
# Combine the extracted values with the point data
liberia_sf1$pop_dens <- covariate_values[, 2]
# View the updated dataset
head(liberia_sf1)Simple feature collection with 6 features and 7 fields
Geometry type: POINT
Dimension: XY
Bounding box: xmin: -10.4671 ymin: 6.016667 xmax: -10.02615 ymax: 6.73333
Geodetic CRS: WGS 84
ntest npos elevation log_elevation county district
1 50 14 17.82621 2.880670 Margibi Mambah Kaba
2 46 10 104.85070 4.652537 Montserrado Todee
3 43 11 119.09543 4.779925 Margibi Gibi
4 50 10 144.10921 4.970571 Margibi Kakata
5 48 9 19.03508 2.946283 Grand Bassa St. John River City
6 50 13 16.99954 2.833186 Grand Bassa Owensgrove
geometry pop_dens
1 POINT (-10.4671 6.2878) 0.2914019
2 POINT (-10.45111 6.5725) 0.6087928
3 POINT (-10.02615 6.56667) 0.2044306
4 POINT (-10.28775 6.73333) 0.5078438
5 POINT (-10.03595 6.016667) 0.1767686
6 POINT (-10.335 6.266666) 1.6693381
In this example, the extract() function assigns the raster value from the population density layer to each point in the dataset. T
Instead of extracting values for exact point locations, it is often useful to aggregate covariate values within a defined area around each point.
For instance, you might want to calculate the average population density within a 5 km radius around each point to smooth out fine-scale variation.
# Create buffers around each point (e.g., 5 km radius)
buffered_points <- st_buffer(liberia_sf_utm, dist = 5000)
# Plot the buffers for visualization
plot(st_geometry(buffered_points), border = "black")
plot(st_geometry(liberia_sf_utm), add = T, pch = 19, col = "red", cex = .2)# Extract raster values within the buffer areas and calculate the
# mean or sum. Note that since we used the utm data to work on the
# meter scale we need to convert them back to WGS84
mean_pop_density <- terra::extract(lbr_pop_100,
st_transform(buffered_points, crs = 4326),
fun = mean, na.rm = TRUE)
# Add the averaged values to the points dataset
liberia_sf1$pop_dens_5km <- mean_pop_density[,2]
# View the updated dataset
head(liberia_sf1)Simple feature collection with 6 features and 8 fields
Geometry type: POINT
Dimension: XY
Bounding box: xmin: -10.4671 ymin: 6.016667 xmax: -10.02615 ymax: 6.73333
Geodetic CRS: WGS 84
ntest npos elevation log_elevation county district
1 50 14 17.82621 2.880670 Margibi Mambah Kaba
2 46 10 104.85070 4.652537 Montserrado Todee
3 43 11 119.09543 4.779925 Margibi Gibi
4 50 10 144.10921 4.970571 Margibi Kakata
5 48 9 19.03508 2.946283 Grand Bassa St. John River City
6 50 13 16.99954 2.833186 Grand Bassa Owensgrove
geometry pop_dens pop_dens_5km
1 POINT (-10.4671 6.2878) 0.2914019 0.6015397
2 POINT (-10.45111 6.5725) 0.6087928 0.5565529
3 POINT (-10.02615 6.56667) 0.2044306 0.2488039
4 POINT (-10.28775 6.73333) 0.5078438 0.5604327
5 POINT (-10.03595 6.016667) 0.1767686 0.2877019
6 POINT (-10.335 6.266666) 1.6693381 2.0974484
Creating a Predictive Grid
A predictive grid is a regularly spaced set of points or cells that spans the study region. This grid serves as the basis for predictions made by your model. The density of the grid (i.e., the distance between grid points) affects both the resolution of the prediction and the computational cost.
Here we can generate a grid of points at a 5 km resolution across Liberia using the create_grid() function in the RiskMap package.
# 1) Create the grid at 5 km resolution
liberia_grid <- st_transform(st_set_geometry(st_as_sf(create_grid(st_transform(liberia_admin0, crs =32629), spat_res = 5)),"geometry"), crs=4326)
# View the updated dataset
head(liberia_grid)Simple feature collection with 6 features and 0 fields
Geometry type: POINT
Dimension: XY
Bounding box: xmin: -7.817525 ymin: 4.376494 xmax: -7.547283 ymax: 4.422194
Geodetic CRS: WGS 84
geometry
1 POINT (-7.682436 4.376744)
2 POINT (-7.637384 4.376663)
3 POINT (-7.592333 4.37658)
4 POINT (-7.547283 4.376494)
5 POINT (-7.817525 4.422194)
6 POINT (-7.772468 4.422121)
The following code adds covariates (county, district, elevation, population density) to the predictive grid.
# 2) Assign region names (county, district) to grid points
liberia_grid <- st_join(liberia_grid, liberia_admin1["shapeName"]) %>% dplyr::rename(county=shapeName) %>%
st_join(liberia_admin2["shapeName"]) %>% dplyr::rename(district=shapeName)
# 3) Extract elevation from raster at grid points
## Download elevation data
#library(elevatr)
#liberia_elev <- get_elev_raster(locations = liberia_admin0,
# z = 5, clip = "locations")
#liberia_elev <- as(liberia_elev, "SpatRaster")
# Extract raster values at the point locations
elevation_values <- terra::extract(liberia_elev, liberia_grid)
# Combine the extracted values with the point data
liberia_grid$elevation <- elevation_values[, 2]
# 4) Obtain population density in 5 km buffer around grid points
# Create buffers around each point (e.g., 5 km radius)
buffered_points <- st_transform(st_buffer(st_transform(liberia_grid, crs = 32629), dist = 5000), crs = 4326)
# Extract raster values within the buffer areas and calculate the
# mean or sum. Note that since we used the utm data to work on the
# meter scale we need to convert them back to WGS84
mean_pop_density <- terra::extract(lbr_pop_100,
st_transform(buffered_points, crs = 4326),
fun = mean, na.rm = TRUE)
# Add the averaged values to the points dataset
liberia_grid$pop_dens_5km <- mean_pop_density[,2]
# View the updated dataset
head(liberia_grid)Simple feature collection with 6 features and 4 fields
Geometry type: POINT
Dimension: XY
Bounding box: xmin: -7.817525 ymin: 4.376494 xmax: -7.547283 ymax: 4.422194
Geodetic CRS: WGS 84
county district geometry elevation pop_dens_5km
1 Maryland Harper POINT (-7.682436 4.376744) 12 2.3635802
2 Maryland Harper POINT (-7.637384 4.376663) 16 1.6534593
3 Maryland Harper POINT (-7.592333 4.37658) 12 1.3063504
4 Maryland Harper POINT (-7.547283 4.376494) 13 1.2204457
5 Maryland Harper POINT (-7.817525 4.422194) 5 0.9524127
6 Maryland Harper POINT (-7.772468 4.422121) 17 1.1257582
Final Preparations for Analysis
Spatial statistical analysis methods typically assume Cartesian coordinates rather than latitude/longitude, so we transform the analysis dataset and predictive grid to the appropriate UTM. In order to use the coordinates as covariates in regression models, we add them as attributes in the sf objects.
Convert Analysis Dataset and Predictive Grid to UTM Coordinates
liberia_sf1_utm <- st_transform(liberia_sf1, crs = 32629)
liberia_grid_utm <- st_transform(liberia_grid, crs = 32629)Add Grid Coordinates as Attributes
liberia_sf1_utm <- liberia_sf1_utm %>% dplyr::mutate(x = st_coordinates(.)[,1],
y = st_coordinates(.)[,2])
liberia_grid_utm <- liberia_grid_utm %>% dplyr::mutate(x = st_coordinates(.)[,1],
y = st_coordinates(.)[,2])Generalized Additive Models
We will now proceed to build a regression model for the Liberia river blindness data. The outcome variable is nodule prevalence. The predictor variables are elevation (in meters), population density (per 10,000 m^2) for a 5~km buffer, and the geographic location (in meters easting and meters northing).
For modeling, we will transform elevation and population density using log (base 2) so that a “linear” effect would reflect a constant impact of doubling the value of the covariate.
liberia_sf1_utm <- liberia_sf1_utm %>% dplyr::mutate(log2_elevation = log2(elevation),
log2_pop_dens_5km = log2(pop_dens_5km))
liberia_grid_utm <- liberia_grid_utm %>% dplyr::mutate(log2_elevation = log2(elevation),
log2_pop_dens_5km = log2(pop_dens_5km))Generalized additive models (GAMs) are generalized linear models in which the effects of covariates are represented using smooth functions. Conceptually, it is very similar to the (fixed df) natural cubic spline approach discussed in other courses. The most pertinent difference is the use of penalization to control the wiggliness of the smooth functions in the GAM rather than relying on the pre-specification of the appropriate number of knots (df).
We will use the mgcv package to fit these models.
There are many different options for specifying smooth terms in a GAM. The default choice of smoother is a thin plate regression spline (bs=“tp”). The technical virtues of thin plate regression splines are beyond the scope of this course. Other choices for the smoother include the more familiar cubic regression splines (bs=“cr”) among many other options. There is an extensive description of the options available in the help file for smooth.terms (link) in the mgcv package.
The following code fits logistic regression for nodule prevalence as a function of log(elevation) and log(population density). For illustration, we will walk through some of the options for specifying the smooths. The following code fits the model using cubic regression splines with 4 df (fixed) as the smooths for each variable.
library(mgcv)Loading required package: nlme
This is mgcv 1.9-1. For overview type 'help("mgcv-package")'.
## binomial outcome y~binomial(n,p) specify outcome as cbind(y,n-y)
## cubic regression spline bs="cr", k=5 knots, fx=TRUE (no penalization) s(x, bs="cr", k=5, fx=TRUE)
nodule_fit1a <- gam(cbind(npos,ntest-npos) ~ s(log2_elevation, bs="cr", k=5, fx=TRUE) + s(log2_pop_dens_5km, bs="cr", k=5, fx=TRUE),
family="binomial", data=liberia_sf1_utm)
summary(nodule_fit1a)
Family: binomial
Link function: logit
Formula:
cbind(npos, ntest - npos) ~ s(log2_elevation, bs = "cr", k = 5,
fx = TRUE) + s(log2_pop_dens_5km, bs = "cr", k = 5, fx = TRUE)
Parametric coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) -1.50360 0.04126 -36.44 <2e-16 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Approximate significance of smooth terms:
edf Ref.df Chi.sq p-value
s(log2_elevation) 4 4 64.89 2.71e-13 ***
s(log2_pop_dens_5km) 4 4 10.63 0.031 *
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
R-sq.(adj) = 0.335 Deviance explained = 36.3%
UBRE = 0.80503 Scale est. = 1 n = 90
plot(nodule_fit1a, pages=1)The following code fits the same model using cubic regression splines with the default choices for the df (basis dimension) for the spline and penalization.
## binomial outcome y~binomial(n,p) specify outcome as cbind(y,n-y)
## cubic regression spline bs="cr", automatic df, penalization s(x, bs="cr") = s(x, bs="cr", k=-1, fx=FALSE)
nodule_fit1b <- gam(cbind(npos,ntest-npos) ~ s(log2_elevation, bs="cr") + s(log2_pop_dens_5km, bs="cr"),
family="binomial", data=liberia_sf1_utm)
summary(nodule_fit1b)
Family: binomial
Link function: logit
Formula:
cbind(npos, ntest - npos) ~ s(log2_elevation, bs = "cr") + s(log2_pop_dens_5km,
bs = "cr")
Parametric coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) -1.49789 0.04086 -36.66 <2e-16 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Approximate significance of smooth terms:
edf Ref.df Chi.sq p-value
s(log2_elevation) 2.261 2.838 64.888 <2e-16 ***
s(log2_pop_dens_5km) 3.699 4.531 8.607 0.0941 .
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
R-sq.(adj) = 0.349 Deviance explained = 36.3%
UBRE = 0.75985 Scale est. = 1 n = 90
plot(nodule_fit1b, pages=1)The following code fits the same model using the default thin plate regression spline.
## binomial outcome y~binomial(n,p) specify outcome as cbind(y,n-y)
## thin plate regression spline, automatic df, penalization s(x) = s(x, bs="tp", k=-1, fx=FALSE)
nodule_fit1 <- gam(cbind(npos,ntest-npos) ~ s(log2_elevation) + s(log2_pop_dens_5km),
family="binomial", data=liberia_sf1_utm)
summary(nodule_fit1)
Family: binomial
Link function: logit
Formula:
cbind(npos, ntest - npos) ~ s(log2_elevation) + s(log2_pop_dens_5km)
Parametric coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) -1.49900 0.04093 -36.62 <2e-16 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Approximate significance of smooth terms:
edf Ref.df Chi.sq p-value
s(log2_elevation) 2.496 3.135 64.710 <2e-16 ***
s(log2_pop_dens_5km) 3.858 4.791 9.275 0.0908 .
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
R-sq.(adj) = 0.351 Deviance explained = 36.9%
UBRE = 0.755 Scale est. = 1 n = 90
plot(nodule_fit1, pages=1)Note that the estimated smooths and effective df are essentially identical regardless of the choice of basis functions (cubic regression splines or thin plate regression splines).
The assumptions underlying the GAMs are, to a large extent, no different from the assumptions made for other regression models. For the most part, you would assess the adequacy of the assumptions (in this case, correct model structure and independence) using the same approach as any other logistic regression. In terms of the functional form, while the exact choice of k-1 (or k), the upper limit on the df for the smmoth, is not generally critical, it is important that it is large enough to represent the “true” function reasonably well. The gam.check() function runs a simple simulation-based check on the basis dimensions, which can flag terms for which k is too low.
gam.check(nodule_fit1)
Method: UBRE Optimizer: outer newton
full convergence after 3 iterations.
Gradient range [-8.125569e-08,1.395182e-06]
(score 0.7549974 & scale 1).
Hessian positive definite, eigenvalue range [0.005798449,0.006220626].
Model rank = 19 / 19
Basis dimension (k) checking results. Low p-value (k-index<1) may
indicate that k is too low, especially if edf is close to k'.
k' edf k-index p-value
s(log2_elevation) 9.00 2.50 0.87 0.14
s(log2_pop_dens_5km) 9.00 3.86 1.13 0.88
Incorporating spatial locations as a covariate in a GAM is straightforward. We simply define a two-dimensional smooth function based on the spatial coordinates \((x,y)\). There are two main options for specifying multi-dimensional smooth functions, (isotropic) thin plate regression splines and tensor-product splines. Tensor product smooths are recommended when the covariates in a smooth are not naturally on the same scale (interactions between two arbitrary quantitative covariates). For spatial data, thin plate regression splines are a reasonable default choice.
## binomial outcome y~binomial(n,p) specify outcome as cbind(y,n-y)
## thin plate regression spline, automatic df, penalization s(x) = s(x, bs="tp", k=-1, fx=FALSE)
## thin plate regression spline for spatial location s(x,y)
nodule_fit2 <- gam(cbind(npos,ntest-npos) ~ s(log2_elevation) + s(log2_pop_dens_5km) + s(x,y),
family="binomial", data=liberia_sf1_utm)
summary(nodule_fit2)
Family: binomial
Link function: logit
Formula:
cbind(npos, ntest - npos) ~ s(log2_elevation) + s(log2_pop_dens_5km) +
s(x, y)
Parametric coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) -1.55153 0.04335 -35.8 <2e-16 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Approximate significance of smooth terms:
edf Ref.df Chi.sq p-value
s(log2_elevation) 3.945 4.951 7.742 0.167
s(log2_pop_dens_5km) 1.000 1.000 0.541 0.462
s(x,y) 23.386 27.043 76.276 1.25e-06 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
R-sq.(adj) = 0.673 Deviance explained = 76.4%
UBRE = 0.24579 Scale est. = 1 n = 90
plot(nodule_fit2)In mgcv’s plot() command, two-dimensional smooths are represented with contour plots. In this plot, the axes represent the spatial location, x (meters easting) and y (meters northing). The interior is a topographic map of predicted values. The contour lines represent points of equal predicted values, and they are labeled. The dotted lines show uncertainty in prediction; they represent how contour lines would move if predictions were one standard error higher or lower.
A contour plot is not always the most intuitive way of plotting interactions, so mgcv has a couple of more options. Setting the scheme parameter in the plot function to 1, we get a 3D perspective plot instead.
#Plot the surface
plot(nodule_fit2, select=3, scheme=1)Setting scheme to 2 generates a heat map. This is a simplified contour map on top of which colors are added. The yellow colors represent larger predictions and the red colors smaller ones.
#Plot the surface
plot(nodule_fit2, select=3, scheme=2)Customizing interaction plots with vis.gam()
The first argument to vis.gam(), x, is the GAM model. The second, view, is where you list which variables in the model you want to visualize jointly.
Setting the plot.type argument to “persp” will produce a 3D perspective plot as shown here.
In this case, x and y are interacting variables, but they do not need to be. You can view a perspective plot of any two variables in the model to see their combined effects.
vis.gam(x = nodule_fit2, #GAM model fit
view = c("x","y"), #variables
plot.type = "persp") #type of plotOr you can set plot.type to “contour”, which will produce a contour plot or heat map.
vis.gam(x = nodule_fit2, #GAM model fit
view = c("x","y"), #variables
plot.type = "contour") #type of plotThe too.far argument is an important one in using these plots to inspect your model. too.far indicates what predictions should not be plotted because they are too far from the actual data. In other words, how far is too far to extrapolate? Setting this value lets you see what combinations of variables are not represented in your data and therefore might not yield good predictions in your model.
too.far is scaled from zero to one along the range of the variables. Here, we set it at 0.1 for the left plot and 0.05 on the right. On the left, 10% extrapolation fills in most of the surface. On the right, 5% extrapolation shows more areas not supported by data.
par(mfrow=c(1,2))
vis.gam(x = nodule_fit2, #GAM model fit
view = c("x","y"), #variables
plot.type = "contour", #type of plot
too.far = 0.1) #limit of extrapolation
vis.gam(x = nodule_fit2, #GAM model fit
view = c("x","y"), #variables
plot.type = "contour", #type of plot
too.far = 0.05) #limit of extrapolationPredicting River Blindness Prevalence in Liberia
Using the predict.gam() function, we can obtain predictions (estimates) of the prevalence of river blindness across Liberia using our 5 km grid of locations. By default, predictions are produced on the logit scale.
liberia_grid_utm$logit_prev <- predict(nodule_fit2,liberia_grid_utm) #predictions on the logit scale
liberia_grid_utm$prev <- predict(nodule_fit2,liberia_grid_utm,type="response") #predictions on the prevalence scale
plot(liberia_grid_utm["prev"],pch=15,main="Prevalence")We can observe the importance of accounting for spatial dependence by examining similar predictions (estimates) for the original model with just elevation and population density.
liberia_grid_utm$logit_prev <- predict(nodule_fit1,liberia_grid_utm) #predictions on the logit scale
liberia_grid_utm$prev <- predict(nodule_fit1,liberia_grid_utm,type="response") #predictions on the prevalence scale
plot(liberia_grid_utm["prev"],pch=15,main="Prevalence")Predictions from GAMs can be accompanied by standard errors based on the posterior distribution of the model coefficients. We will use the standard errors to estimate the (posterior) probability that the prevalence exceeds 20% at each location.
liberia_grid_utm$logit_prev <- predict(nodule_fit2,liberia_grid_utm) #predictions on the logit scale
liberia_grid_utm$se_logit_prev <- predict(nodule_fit2,liberia_grid_utm,se.fit=TRUE)$se.fit #SEs for predictions on the logit scale
liberia_grid_utm <- liberia_grid_utm %>% dplyr::mutate(exceed20 = pnorm(log(.2/.8),logit_prev,se_logit_prev,lower.tail=FALSE))
plot(liberia_grid_utm["exceed20"],pch=15,main="Exceedance Probability")